Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches

ABSTRACT

A multithreaded processor, fetch control for a multithreaded processor and a method of fetching in the multithreaded processor. Processor event and use (EU) signals are monitored for downstream pipeline conditions indicating pipeline execution thread states. Instruction cache fetches are skipped for any thread that is incapable of receiving fetched cache contents, e.g., because the thread is full or stalled. Also, consecutive fetches may be selected for the same thread, e.g., on a branch mis-predict. Thus, the processor avoids wasting power on unnecessary or place keeper fetches.

CROSS REFERENCE TO RELATED APPLICATION

The present invention is a divisional of U.S. patent application Ser.No. 11/928,686, (Attorney docket No. YOR920040167US3) entitled “ADAPTIVEFETCH GATING IN MULTITHREADED PROCESSORS, FETCH CONTROL AND METHOD OFCONTROLLING FETCHES” to Pradip Bose et al., filed Oct. 30, 2007, and acontinuation of allowed U.S. patent application Ser. No. 11/228,781,(Attorney docket No. YOR920040167US2) entitled “ADAPTIVE FETCH GATING INMULTITHREADED PROCESSORS, FETCH CONTROL AND METHOD OF CONTROLLINGFETCHES” to Pradip Bose et al., filed Sep. 16, 2005, which is acontinuation of U.S. Provisional Patent Application Ser. No. 60/610,990,entitled “System And Method For Adaptive Fetch Gating” to Pradip Bose etal., filed Sep. 17, 2004, both of which are incorporated herein byreference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention generally relates to the multi-threaded processorsand more particularly to reducing power consumption in a SimultaneousMultiThreaded (SMT) processor or microprocessor.

2. Background Description

Semiconductor technology and chip manufacturing advances have resultedin a steady increase of on-chip clock frequencies, the number oftransistors on a single chip and the die size itself. Thus, notwithstanding the decrease of chip supply voltage, chip power consumptionhas increased as well. Both at the chip and system levels cooling andpackaging costs have escalated as a natural result of this increase inchip power. At the low end for small systems (e.g., handhelds, portableand mobile systems), where battery life is crucial, it is important toreduce net power consumption, without having performance degrade tounacceptable levels. Thus, the increase in microprocessor powerconsumption has become a major stumbling block for future performancegains. Pipelining is one approach to maximizing processor performance.

A scalar processor fetches and issues/executes one instruction at atime. Each such instruction operates on scalar data operands. Each suchoperand is a single or atomic data value or number. Pipelining within ascalar processor introduces what is known as concurrency, i.e.,processing multiple instructions at difference pipeline stages in agiven clock cycle, while preserving the single-issue paradigm.

A superscalar processor can fetch, issue and execute multipleinstructions in a given machine cycle, each in a different executionpath or thread. Each instruction fetch, issue and execute path isusually pipelined for further, parallel concurrency. Examples ofsuperscalar processors include the Power/PowerPC processors from IBMCorporation, the Pentium processor family from Intel Corporation, theUltrasparc processors from Sun Microsystems and the Alpha processor andPA-RISC processors from Hewlett Packard Company (HP). Front-endinstruction delivery (fetch and dispatch/issue) accounts for asignificant fraction of the energy consumed in a typical state of theart dynamic superscalar processor. For high-performance processors, suchas IBM's POWER4™, the processor consumes a significant portion of chippower in the instruction cache (ICACHE) during normal access and fetchprocesses. Of course, when the fetch process stalls, temporarily (e.g.,due to instruction buffer fill-up, or cache misses), that portion ofchip power falls off dramatically, provided the fetch process is stalledalso.

Unfortunately, other factors (e.g., chip testability, real estate,yield) tend to force a trade of power for control simplification. So, inprior generation power-unaware designs, one may commonly find processorsarchitected to routinely access the ICACHE on each cycle, even when thefetched results may be discarded, e.g., due to stall conditions. Buffersand queues in such processor designs have fixed sizes, and depending onthe implementation, consume power at a fixed rate, irrespective ofactual cache utilization or workload demand. For example, for a typicalstate of the art instruction fetch unit (IFU) in a typical state of theart eight-issue superscalar processor, executing a class of commercialbenchmark applications, only about 27% of the cycles result in usefulfetch activity. Similarly, idle and stalled resources of a front-endinstruction decode unit (IDU) pipe wastes significant power. Further,this front-end starvation keeps back-end execute pipes even moreunderutilized, which impacts processor throughput.

By contrast, in what is known as an energy-aware design, the fetchand/or issue stages are architected to be adaptive, to accommodateworkload demand variations. These energy-aware designs adjusts the fetchand/or issue resources to save power without appreciable performanceloss. For example, Buyuktosunoglu et al. (Buyuktosunoglu I), “Energyefficient co-adaptive instruction fetch and issue,” Proc. Int'l. Symp.on Computer Architecture (ISCA), June 2003 and Buyuktosunoglu et al.(Buyuktosunoglu II), “Tradeoffs in power-efficient issue queue design,”Proc. ISLPED, August 2002, both discuss such energy aware designs. Inparticular, Buyuktosunoglu I and II focus on reconfiguring the size ofissue queues, in conjunction (optionally) with an adjustable instructionfetch rate. In another example, Manne et al., “Pipeline Gating:Speculation Control for Energy Reduction,” Proc. 25^(th) Int'l. Symp. onComputer Architecture (ISCA), 1998, teaches using the processor branchmis-prediction rate in the instruction fetch to effectively control thefetch rate for power and efficiency. Unfortunately, monitoring thebranch prediction accuracy requires additional, significant and complexon-chip hardware that consumes both valuable chip area and power.

This problem is exacerbated in multithreaded machines, where multipleinstruction threads may, or may not be in the pipeline at any one time.For example, Karkhanis et. al, “Saving energy with just-in-timeinstruction delivery,” Proc. Int'l. Symp. on Low Power Electronics andDesign (ISLPED), August 2002, teach controlling instruction fetch rateby keeping a count of valid, downstream instructions. Both U.S. Pat. No.6,212,544 to Borkenhagen et al. (Borkenhagen I), entitled “Alteringthread priorities in a multithreaded processors,” and U.S. Pat. No.6,567,839 to Borkenhagen et al. (Borkenhagen II), “Thread switch controlin a multithreaded processor system,” both assigned to the assignee ofthe present invention and incorporated herein by reference, teachdesigning efficient thread scheduling control for boosting performanceand/or reducing power in multithreaded processors. In yet anotherexample, Seng et al. “Power-Sensitive Multithreaded Architecture,” Proc.Int'l. Conf on Computer Design (ICCD) 2000, teaches an energy-awaremultithreading design.

State of the art commercial microprocessors (e.g. Intel's Netburst™Pentium™ IV or IBM's POWER5™) use a mode of multithreading that iscommonly referred to as Simultaneous MultiThreading (SMT). In eachprocessor cycle, a SMT processors simultaneously fetches instructionsand/or dispatches for different threads that populate the back-endexecution resources. Fetch gating in an SMT processor refers toconditionally blocking the instruction fetch process. Threadprioritization involves assigning priorities in the order of fetchinginstructions from a mix of different workloads in a multi-threadedprocessor. Some of the above energy-aware design approaches have beenapplied to SMT. For example, Luo et al. “Boosting SMT Performance bySpeculation Control,” Proc. Int'l. Parallel and Distributed ProcessingSimulation, (IPDPS), 2001, teaches improving performance in energy-awareSMT processor design. Moursy et al. “Front-End Policies for ImprovedIssue Efficiency in SMT Processors,” Proc. HPCA 2003, focuses onreducing the average power consumption in SMT processors by sacrificingsome performance. By contrast, Knijnenburg et al. “Branch Classificationfor SMT Fetch Gating,” Proc. MTEAC 2002 focuses on increasingperformance without regard to complexity. These energy aware approachesrequire complex variable instruction fetch rate mechanisms and controlsignals necessitating significant additional logic hardware. Theadditional logic hardware dynamically calculates complex utilization,prediction rates and/or flow rate metrics within the processor orsystem. However, the verification logic of such control algorithms addsoverhead in complexity, area and power, that is not amenable to a lowcost, easy implementation for high performance chip designs. Thisoverhead just adds to both escalating development costs and spiralingpower dissipation costs.

Unfortunately, many of these approaches have achieved improvedperformance only at the cost of increased processor power consumption.Others have reduced power consumption (or at least net energy usage) byaccepting significantly degraded performance. Still others have acceptedcomplex variable instruction fetch rate mechanisms that necessitatesignificant additional logic hardware.

Thus, there is a need for a processor architecture that minimizes powerconsumption without impairing processor performance and withoutrequiring significant control logic overhead or power.

SUMMARY OF THE INVENTION

It is therefore a purpose of the invention to minimize processor powerconsumption;

It is another purpose of the invention to minimize SimultaneousMultiThreaded (SMT) processor power consumption;

It is yet another purpose of the invention to minimize SMT processorpower consumption without incurring significant performance or areaoverhead.

The present invention is related to multithreaded processor, fetchcontrol for a multithreaded processor and a method of fetching in themultithreaded processor. Processor event and use (EU) signals aremonitored for downstream pipeline conditions indicating pipelineexecution thread states. Instruction cache fetches are skipped for anythread that is incapable of receiving fetched cache contents, e.g.,because the thread is full or stalled. Also, consecutive fetches may beselected for the same thread, e.g., on a branch mis-predict. Thus, theprocessor avoids wasting power on unnecessary or place keeper fetches.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, aspects and advantages will be betterunderstood from the following detailed description of a preferredembodiment of the invention with reference to the drawings, in which:

FIG. 1 shows a general example of Simultaneous MultiThreaded (SMT)architecture wherein the front end of a state of the art SMT processoris optimized for minimum power consumption without impacting performanceor area according to a preferred embodiment of the present invention;

FIG. 2 shows a block diagram of a more specific example of a preferredembodiment SMT processor in more detail that supports two threads inthis example;

FIGS. 3A-B show an example of the preferred fetch control, whichdetermines on each cycle, whether a fetch from the ICACHE occurs, basedon the current state of thread monitor and control flags;

FIGS. 4A-B show examples of a state diagrams for the preferredembodiment fetch control from thread monitor and control flags.

DESCRIPTION OF PREFERRED EMBODIMENTS

Turning now to the drawings, and more particularly, FIG. 1 shows ageneral example of Simultaneous MultiThreaded (SMT) architecture whereinthe front end of a state of the art SMT processor 100 is optimized forminimum power consumption without impacting performance or area,according to a preferred embodiment of the present invention. The SMTprocessor 100, which may be a single chip or multi-chip microprocessor,includes an instruction cache (ICACHE) 102 with a number of tasks orapplications in cache contents from which to select/fetch. The ICACHE102 provides cached instructions for R threads that originate from oneof R ports 104-1, 104-2, - - - 104-R. Preferred embodiment prioritythread selection logic 106 selectively fetches and passes the contentsof each of ports 104-1, 104-2, - - - 104-R to an Instruction Fetch Unit(IFU) pipeline 108. Each of the R ports 104-1, 104-2, - - - 104-R has afixed maximum fetch bandwidth to the IFU pipeline 108 of a number ofinstructions per cycle. Thus, the preferred embodiment priority threadselection logic 106 may pass the contents from each port 104-1,104-2, - - - 104-R at a rate up to that maximum with the overallbandwidth being R times that maximum.

The IFU 108 passes instructions into T front-end Instruction BUFfers(IBUF), 110-1, 110-2, - - - 110-T, one for each supported machineexecution thread. The preferred embodiment priority thread selectionlogic 106 also receives Event and Use (EU) signals or flags to controlfetch and thread selection for the fetch process, determine targetinstruction buffer threads in instruction buffers 110-1, 110-2, - - -110-T, as well as order within the threads and the number ofinstructions fetched, if any, for a given thread. Instructions in eachinstruction buffer 110-1, 110-2, - - - 110-T pass through acorresponding decode and dispatch unit, 112-1, 112-2, - - - 112-T and,subsequently, emerge under control of dispatch-thread priority logic114. The dispatch-thread priority logic 114 selects instructions fromvarious different threads and multiplexes the selected instructions asan input to a common dispatch buffer 116. This dispatch buffer 116issues instructions into the back-end execution pipes (not shown in thisexample).

It may be shown that, absent preferred embodiment fetch control, withinan average processor cycle window, the front-end fetch engine of thisSMT processor 100 example accesses the ICACHE 102 much more frequentlythan necessary and uses the instruction buffers, 110-1, 110-2, - - -110-T, much more than necessary. Thus, the preferred embodiment fetchcontrol balances the power-performance of the front-end fetch engine ofthis SMT processor 100 for dramatically improved efficiency.

FIG. 2 shows a block diagram of a more specific example of a preferredembodiment SMT processor 120 in more detail, supporting two threads inthis example. The ICACHE 122 has a single read port 124 to preferredfetch control 126. The preferred fetch control 126 selectively fetchesinstructions and forwards fetched instructions to front end pipelinestages 128. So, instructions exiting the front end pipe line stages 128pass through multiplexor/demultiplexor (mux/demux) 132 and enter anInstruction BUFfer (IBUF) in one of two threads, 134-0, 134-1 of thisexample. Each thread passes through a number of buffer pipeline stages136-0, 136-1, eventually emerging from an Instruction Register (IR)138-0, 138-1. A multiplexer 140, selects a mix of instructions from thecontents of the instruction registers 138-0, 138-1 to back end processorlogic (not shown), e.g., to a dispatch group for back end execution. AnInstruction Fetch Address Register (IFAR) 142-0, 142-1 addresses eachfetched instruction.

Thread monitor and control flags 144, 146, 148, 150 determine in eachclock cycle whether the preferred fetch control 126 forwards aninstruction from the ICACHE 122, that is identified by one of theinstruction fetch address registers 142-0, 142-1. In this example, thethread monitor and control flags include stall event flags (e.g., branchmis-predicts, cache misses, etc.) 144, flow rate mismatch flags 146,utilization flags 148 and, optionally, thread priority flags 150. Theutilization flags 148 may include individual instruction buffer highwater mark controls 148-0, 148-1 that also operate to stallcorresponding instruction buffers 134-0, 134-1, whenever a respectivethread pipeline is full to its respective high water mark. Although theutilization flags 148-0 and 148-1 are indicated herein as two flags,each having to do with the instruction buffers 134-0, 134-1, this is forexample only. Multiple utilization flags may be included as downstreamutilization markers. For example, a high watermark may be provided forvarious other downstream queues, e.g., in the execution back-end of themachine, that may provide additional or alternate inputs to thepreferred fetch control 126.

However, for any particular cycle in the example of FIG. 2, when a fetchis enabled, the address in the instruction fetch address register,142-0, 142-1 may simply be incremented from the previous cycle, e.g., byan incrementer 152-0, 152-1. Alternately, the address may be loaded fromnext fetch address logic 154-0, 154-1, e.g., in response to a branch.So, for example, the next address may depend upon an interrupt, a branchinstruction or Branch History Table/Branch Target Buffer (BHT/BTB)contents. Further, the next fetch address logic 154-0, 154-1 logic maybe implemented using any suitable such fetch address logic to generatethe next cache address as may be appropriate for the particularapplication.

The preferred fetch control 126 infers thread stall states,cycle-by-cycle, from the stall flags 144 indicating selected stallevents, e.g., branch mis-prediction, cache miss, and dispatch stall.These stall event flags 144 are often routinely tracked on-chip in stateof the art processors, e.g., using performance counters, or as part ofother book-keeping and stall management. However, in accordance with apreferred embodiment of the present invention, the stall flags 144 areinvoked as override conditions to prevent/enable fetch-gating for astalled thread, or to redirect fetches for another thread. Also, when abranch mis-prediction occurs in a given thread, the thread contents areinvalid. The preferred fetch control 126 gives that thread priority andallows uninhibited fetches at full bandwidth to fill up pipeline slotsin the thread that are vacated by flushed instructions.

Downstream utilization state flags 148 provide a set of high watermarkindicators that the preferred fetch control 126 monitors for developingpath criticalities. Thus, each high watermark flag 148, when asserted,indicates that a particular queue or buffer resource is almost full.Depending on whether a thread-specific resource or a shared resource isfilling, a thread selection and prioritization policy may be defined inthe preferred fetch control 126 and dynamically adjusted to indicatewhen any particular resources are at or near capacity. Upon such anoccurrence, the preferred fetch control 126 may invoke fetch-gatingbased on the falloff of downstream demand to save energy wheneverpossible.

FIGS. 3A-B show examples of inputs and output control to the preferredfetch control 126 for determining on each cycle, whether a fetch fromthe ICACHE 122 occurs based on the current state of thread monitor andcontrol flags 144, 146, 148, 150, collectively, 160 in this example.Preferably, the fetch control logic 126, is a simple finite statemachine, that monitors a small subset of processor utilizationindicators, e.g., stall state and last thread identifier. Thus, threadmonitor and control flags 160 may include, for example, a branchmis-prediction indicator, a cache miss indicator, an execution pipelinestall indicator, a dependence-related dispatch stall indicator, aresource-conflict stall indicator, and a pipeline flush-and-replay stallindicator. The fetch control logic 126 may include a finite statecontroller with two outputs, a fetch_gate 162 and a next_thread_idindicator 164. The fetch_gate 162 is a Boolean flag that is assertedwhenever gating the instruction fetch is deemed to be desirable. Thenext_thread_id indicator 164 points to the thread for fetching in thenext cycle. A miss/stall latch 166 holds the last fetch identificationand latches the current thread fetch identification for facilitating indetermining in each fetch cycle, the next thread fetch identification. Afetch gate output enables gating the contents of the ICACHE (122 in FIG.2) as selected by the corresponding fetch address register (142-0,142-1). The inverse of the fetch gate 162, inverted by inverter 168 inthis example, combines with a dispatch stall signal 170 in an AND gate172 to provide a flow rate indicator as a flow mismatch flag 146 in FIG.2.

FIGS. 4A-B show examples of a state diagrams for the preferredembodiment fetch control 126 of FIGS. 2 and 3A from thread monitor andcontrol flags 160. In step 1460 of FIG. 4A, the flags 160 are checkedfor an indication of a flow rate mismatch. If a flow rate mismatch isnot indicated, then in 1462, the flags 160 are checked for an indicationthat a branch mis-prediction has occurred. If the flags 160 do notindicate a branch mis-prediction either, then in 1464 the next ICACHEfetch is for a thread that is different than the last. However, if it isdetermined in 1460 that a flow rate mismatch has occurred, then in 1466the flags 160 are checked for a Data/Instruction (D/I) cache miss. If aD/I cache miss has not occurred, then in 1468, the flags 160 are checkedfor an indication that a branch mis-prediction has occurred. If theflags 160 indicate that a branch mis-prediction has occurred in either1462 or 1468, then in 1470, a determination is made of which thread,e.g., thread 0, thread 1, or both in this example. If in 1470 themis-prediction indication is: thread 0, then in 1472, the next thread IDis set to indicate thread 0; thread 1, then in 1474, the next thread IDis set to indicate thread 1; otherwise, both threads are indicated andin 1476, and the next thread ID is set to indicate that it is undefined.Also, if branch mis-prediction is determined not to have occurred in1468, then, the next thread ID is undefined in 1476. Since the nextthread ID is undefined in 1476, the fetch gate should be enabled, andnothing should be fetched from either thread in the next cycle. If it isdetermined that a D/I cache miss has occurred in 1466, then in 1478, adetermination is made of which thread, e.g., thread 0, thread 1, or bothin this example. A determination of either thread 0, or thread 1,results in an opposite indication of determination 1470.

Similarly, FIG. 4B, the flags 160 are checked for an indication of thatthe high water mark for one of the instruction buffers is above aselected threshold. So, for the example of FIG. 2, in 1480, the highwater mark is checked for instruction buffer 0. Depending on the resultsof that check, the high water mark is checked for instruction buffer 1in 1482 if the high water mark for instruction buffer 0 is at or abovethat threshold, or in 1484 if the high water mark for instruction buffer0 is below the threshold. If in 1482, the high water mark forinstruction buffer 1 is below the threshold; then, in 1486 the flags 160are checked for an indication that a branch mis-prediction has occurred.If a branch mis-prediction has not occurred, then in 1488 the nextthread ID is set to indicate that it is undefined; and, simultaneously,the previous thread ID is held (e.g., in the miss/stall latch 162 ofFIG. 3A) and the fetch gate is asserted. Similarly, in 1484 if the highwater mark for instruction buffer 1 is at or above the threshold; then,in 1490 the flags 160 are checked for an indication that a branchmis-prediction has occurred. If in either 1486 or 1490, a branchmis-prediction is found to have occurred; then in 1492, a determinationis made of which branch, again, thread 0, thread 1, or both in thisexample. If in 1492 the mis-prediction indication is: thread 0, then in1494, the next thread ID is set to indicate thread 0; thread 1, then in1496, the next thread ID is set to indicate thread 1; otherwise, boththreads are indicated and in 1498 and the next ICACHE fetch is for athread that is different than the last. If in 1482, the high water markfor instruction buffer 1 was found at or above the threshold, the nextthread ID is set to indicate thread 1 in 1496. If in 1484, the highwater mark for instruction buffer 1 was found below the threshold, thenext thread ID is set to indicate thread 0 in 1494. Finally, if a branchmis-prediction is found to have occurred in 1490; then, the next ICACHEfetch is for a thread that is different than the last in 1496. Thus,using fetch control according to the present invention provides simple,effective adaptive fetch-gating for front-end thread selection andpriority logic for significant performance gain, and with simultaneousfront-end power reduction.

Advantageously, the thread monitor and control flags 144, 146, 148, 150of FIG. 2 provide a simple indication of a processor state that derivecache gating controls to prevent unnecessary or superfluous instructioncache fetches or accesses. Accordingly, the preferred embodimentadaptive fetch-gating infers gating control from a typical set of(normally found in state of the art processor architectures) queuemarkers and event flags, and/or flags that are added or supplementedwith insignificant area and timing overhead. Further, the presentinvention has application to SMT processors, generally, where adaptivefetch gating may be combined naturally with an implicit set ofpower-aware thread prioritization heuristics. For single-threadedprocessing, application of the invention naturally reduces to simple,adaptive fetch gating. Additionally, the preferred fetch gating hasapplication on a cycle-by-cycle basis to determining whether each fetchshould proceed, and if so, from which of a number of available threads.In yet another advantage, application of the invention to a typicalstate of the art processors significantly improves processor throughputperformance, while reducing the number of actual cache accesses and,therefore, dramatically reducing energy consumption. The energyconsumption reduction from application of the present invention may farexceed the reduction in execution time, thereby providing an overallaverage power dissipation reduction as well.

While the invention has been described in terms of preferredembodiments, those skilled in the art will recognize that the inventioncan be practiced with modification within the spirit and scope of theappended claims. It is intended that all such variations andmodifications fall within the scope of the appended claims. Examples anddrawings are, accordingly, to be regarded as illustrative rather thanrestrictive.

1. A multithreaded processor comprising: an instruction cache with aplurality of cache locations; a thread selection and priority circuitconfigured to monitor processor flags and selectively retrieve contentsof each of said plurality of cache locations; an instruction fetch unitpipeline configured to receive the selectively retrieved contents fromsaid thread selection and priority circuit; and a plurality ofinstruction buffer threads, wherein each of the selectively retrievedcontents are passed to one of the instruction buffer threads throughsaid instruction fetch unit pipeline, the thread selection and prioritycircuit being further configured to retrieve contents only for threadsindicated by the processor flags as being capable of receiving theselectively retrieved contents.
 2. A multithreaded processor as in claim1, wherein said processor flags include pipeline stall flags, flowmismatch flags, and utilization flags.
 3. A multithreaded processor asin claim 2, wherein said processor flags further include thread priorityflags.
 4. A multithreaded processor as in claim 1, wherein said threadselection and priority circuit comprises an instruction cache fetchcontrol circuit receiving said processor flags and determining whethercontents are fetched from said instruction cache and further selectinginstruction cache contents being fetched.
 5. A multithreaded processoras in claim 4, wherein said instruction cache fetch control circuit is astate machine.
 6. A multithreaded processor as in claim 5, wherein saidstate machine comprises: means for determining flow rate mismatch insaid pipeline from said flags; means for determining Data/Instruction(D/I) cache misses, responsive to a flow rate mismatch; means fordetermining a branch mis-prediction responsive to said flags; means fordetermining a next thread responsive to said means for flow ratemismatch determination and said means for D/I cache miss determination;and means for indicating a next thread.
 7. A multithreaded processor asin claim 5, wherein said state machine comprises: means for determiningwhether each thread is at or above a high water mark; means fordetermining a mis-prediction responsive to said flags; means fordetermining a next thread responsive to said means for determining amis-prediction; and means for indicating a next thread.
 8. Amultithreaded processor as in claim 4, wherein said instruction cachefetch control circuit provides a fetch gate signal and a threadidentification to said cache responsive to said flags.
 9. Amultithreaded processor as in claim 8, wherein said fetch gate signal iscombined with a dispatch stall signal, a flow rate mismatch flag beingprovided from the combination.
 10. A multithreaded processor as in claim1, wherein said multithreaded processor is a Simultaneous MultiThreaded(SMT) processor.
 11. An instruction fetch controller connectable betweenan instruction cache and a plurality of instruction buffers, saidinstruction fetch controller comprising: one or more inputs, eachconnected to an instruction cache port, each input receivinginstructions from one or more threads stored in respective instructioncache banks; one or more instruction outputs, each selectively providingone or more fetched instructions to a corresponding instruction buffer;and one or more control inputs providing receiving event and use (EU)signals, said EU signals selecting in any clock cycle whether aninstruction is fetched from the instruction cache.
 12. An instructionfetch controller as in claim 11, where the EU signals furthercontrolling fetch selection priority and the instruction fetchcontroller selects which instructions are fetched in each clock cycleresponsive to said EU signals whenever instructions are fetched.
 13. Aninstruction fetch controller as in claim 11, where the EU signals areselected from the group comprising: a signal indicating level of aqueue, a stall event indicator, a thread priority indicator, aninformation flow rate indicator, a status flag, a pipeline stallcondition indicator, a logical input indicator, a function inputindicator, statistical indication signal, a historical state signal, anda state signal.
 14. An instruction fetch controller as in claim 13,wherein the stall event indicator is selected from the group comprising:a branch mis-prediction indicator, a cache miss indicator, an executionpipeline stall indicator, a dependence-related dispatch stall indicator,a resource-conflict stall indicator, and a pipeline flush-and-replaystall indicator.
 15. An instruction fetch controller as in claim 13,wherein the signal indicating level of the queue is a high watermarkindicator for a buffer selected from the group comprising: aninstruction fetch buffer, a load buffer, a store buffer, and an issuebuffer.
 16. An instruction fetch controller as in claim 11, wherein oneor more of the EU signals are dispatch stage thread priority signals.17. An instruction fetch controller as in claim 16, wherein the dispatchstage thread priority signals are asserted by software.
 18. Aninstruction fetch controller as in claim 17, wherein the dispatch stagethread priority signals are hardware generated signals.
 19. Aninstruction fetch controller as in claim 16, wherein the dispatch stagethread priority signals indicate an encoding order for consideringthreads in selecting cache contents and dispatching selected contentsfor execution in a given cycle.
 20. A Simultaneous MultiThreaded (SMT)processor comprising: an instruction cache with a plurality of cachelocations; an instruction fetch unit pipeline configured to receiveselectively retrieved cache contents; a plurality of instructionbuffers, each selectively receiving data and instructions from saidinstruction fetch unit pipeline, wherein said instruction fetch unitpipeline is between said instruction cache and said plurality ofinstruction buffers, cache contents received by each of said pluralityof instruction buffers passing through said instruction fetch unitpipeline to a respective instruction buffer; an plurality of instructionbuffer threads, each of said plurality of instruction buffer threadstraversing one of said plurality of instruction buffers; and a threadselection and priority circuit monitoring event and use (EU) flagsignals for an indication of ones of said plurality of instructionbuffers being capable of receiving retrieved cache contents, said threadselection and priority circuit selecting cache content locations beingfetched and retrieving said cache contents from selected said cachecontent locations for a thread traversing an indicated one.
 21. A SMTprocessor as in claim 20, wherein said thread selection and prioritycircuit comprises an instruction cache fetch control circuit receivingsaid EU flags and determining whether contents are fetched from saidinstruction cache and further selecting instruction said cache contentlocations.
 22. A SMT processor as in claim 21, wherein said instructioncache fetch control circuit is a state machine comprising: means fordetermining flow rate mismatch in said pipeline from said EU flags;means for determining Data/Instruction (D/I) cache misses, responsive toa flow rate mismatch; means for determining a branch mis-predictionresponsive to said EU flags; means for determining a next threadresponsive to said means for flow rate mismatch determination and saidmeans for D/I cache miss determination; and means for indicating a nextthread.
 23. A SMT processor as in claim 21, wherein said instructioncache fetch control circuit is a state machine comprising: means fordetermining whether each thread is at or above a high water mark; meansfor determining a mis-prediction responsive to said EU flags; means fordetermining a next thread responsive to said means for determining amis-prediction; and means for indicating a next thread.
 24. A SMTprocessor as in claim 21, wherein said instruction cache fetch controlcircuit provides a fetch gate signal and a thread identification to saidcache responsive to said EU flags.
 25. A SMT processor as in claim 24,wherein said fetch gate signal is combined with a dispatch stall signal,a flow rate mismatch flag being provided from the combination.
 26. A SMTprocessor as in claim 21, wherein said EU flags include pipeline stallflags, flow mismatch flags, utilization flags and thread priority flags.